Reproducibility

Steps of data analysis: -Define the question

-Define the ideal data set

-Determine what data you can access

-Obtain the data

-Clean the Data

-Exploratory data analysis

-Statistical prediction/Modeling

-Interpret results

-Challenge results

-Synthesize/Write up results

-Create reproducible code

Probability

P(A u B) = P(A) + P(B) - The probability of the union between scenario A and B is simply the sum of each. Random variable = is the numerical outcome of an experiment (discrete or continuous) - e.g. the flip of a coin (discrete random variable) - e.g. the roll of a dice (also discrete random variable because it’s limited to 1-6) - e.g. BMI of a person is a continuous random variable

  For random variables, the probabilities must add  up to 1, and components are larger or equal to 0. 
    Probability mass function - for discrete random variables
    Probability density function - for continuous random variables
        in R, probability of some instance occurring within a continuous distribution can be calculated by using pbeta(). Here, p before an argument represents "probability". q before an argument represents "quantile"
        Conditional probability- think lightning strike on a storming vs. sunny day. 
            P(A|B) = P(A u B)/P(B) is the conditional probability. Written as the probability of A given condition B = ....
            

mean = center of a distribution variance and standard deviations = how spread out the distribution is

 ## Beta distribution
pbeta(0.75, 2, 1) ##here 0.75 is 75% probability of some density.
## [1] 0.5625
pbeta(c(.4, .5, .6), 2, 1) ##40-60% probability example of the same density.
## [1] 0.16 0.25 0.36
pnorm(70, mean = 80, sd = 10) ## probability to have value 70 given a set mean and standard dev. Data is normally distributed. 
## [1] 0.1586553
qnorm(0.95, mean=1100, sd = 75) ## will return a value of a normal distribution given a specificed percentile (95%), and a known mean and st.dev. 
## [1] 1223.364
pnorm(16, mean = 15, sd = 1) - pnorm(14, mean = 15, sd = 1) ##probaility that a an output is between two known values, i.e. 14 and 16
## [1] 0.6826895
ppois(10, lambda = 15) ##poisson distributed data. Probability of seeing 10 or less of an instance, where lamda represents 3 hours,l and 5 people per hour, hence 15. 
## [1] 0.1184644

    g <- ggplot(galton, aes(x = child))
    g <- g + geom_histogram(fill = "salmon", 
      binwidth=1, aes(y = ..density..), colour = "black")
    g <- g + geom_density(size = 2)
    g <- g + geom_vline(xintercept = mean(galton$child), size = 2)
    g

Variance = Total sum of squares

lambda <- 0.2 
nsim <- 1:1000 # Number of Simulations/rows
n <- 40 

Ematrix <- data.frame(x = sapply(nsim, function(x) {mean(rexp(n, lambda))})) ##creating a matrix of values
head(Ematrix)
Smean <- apply(Ematrix, 2, mean) ## calculating the simulated mean from above matrix.
Smean
##       x 
## 5.00173
Tmean <- 1/lambda  ##calculating the theoretical mean given lambda
Tmean
## [1] 5
SSD <- sd(Ematrix$x)  ##The simulated standard deviation
SSD
## [1] 0.8158697
SVar <- var(Ematrix$x)  ##The simulated variance from the above matrix.
SVar
## [1] 0.6656433
TSD <- (1/lambda)/sqrt(n)  ##The theoretical standard deviation.
TSD
## [1] 0.7905694
TVar <- TSD^2   ##The theoretical calculation for the Variance
TVar
## [1] 0.625

t-test

(X - X*)/(s/sqrt(n)) in other words, sample mean minus the hypothesized/test mean or value divided by standard error of the mean divided by square root of sample size, n. This follows a t-distribution with n-1 degrees of freedom. You can calculate the T-distribution using qt(.95, 15), where .95 is the percentile or an alpha of 0.05, and 15 is n-1 in this example.

library(UsingR); data(father.son)
t.test(father.son$sheight - father.son$fheight)  ##testing whether there are significant differences between the two columns, one for father height and one for son's height. 
## 
##  One Sample t-test
## 
## data:  father.son$sheight - father.son$fheight
## t = 11.789, df = 1077, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.8310296 1.1629160
## sample estimates:
## mean of x 
## 0.9969728
library(datasets)
data(ToothGrowth)
head(ToothGrowth)
library(ggplot2)

plot0 <- ggplot(ToothGrowth, aes(supp, len, fill = supp))
plot1 <- plot0 + geom_dotplot(binaxis ="y") +
      facet_grid(.~ dose) 
      plot1
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

t.test(len ~ supp, data = ToothGrowth) ##testing sig difference between supp types and length
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333
test1 <- subset(ToothGrowth, dose %in% c(.5,1)) ##subsetting doses to make pairwise comparisons
t.test(len ~ dose, data = test1) ##testing sig differences between discrete doses and length
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means between group 0.5 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735
test2 <- subset(ToothGrowth, dose %in% c(1,2)) ##subsetting doses to make pairwise comparisons
t.test(len ~ dose, data = test2)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

t-test Confidence Interval

## Prompt for below: a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain vol
mn = 1100
s = 30
n = 9
round(1100 + c(-1,1)*qt(.975, df = 8)*s/sqrt(n))
## [1] 1077 1123
## diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up - baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?
n = 9
mn_dif = 2
t = .95
(y_d <- round(mn_dif*sqrt(n) / qt(.975, df = 8), 2))
## [1] 2.6

Two sample t-test

##Running a two-sample t-test
data(ChickWeight); library(reshape2)
wideCW <- dcast(ChickWeight, Diet + Chick ~ Time, value.var = "weight") ##dcasting the long-form of the dataset into the wide form by chick, and diet, cast over the time variable, where the represented data is the weight. 
names(wideCW)[-(1:2)] <- paste( "time", names(wideCW)[-(1:2)], sep= "") ##renames columns after the first row, second column, and pastes "time" before the existing column header. 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
wideCW <- mutate(wideCW, gain = time21-time0)  ##adds a column titled gain, where it subtracts or takes the difference between time 21 and 0, and enters a new column. 

wideCW_14 <- subset(wideCW, Diet %in% c(1,4))
t.test(gain~Diet, paired = FALSE, var.equal = TRUE, data = wideCW_14)
## 
##  Two Sample t-test
## 
## data:  gain by Diet
## t = -2.7252, df = 23, p-value = 0.01207
## alternative hypothesis: true difference in means between group 1 and group 4 is not equal to 0
## 95 percent confidence interval:
##  -108.14679  -14.81154
## sample estimates:
## mean in group 1 mean in group 4 
##        136.1875        197.6667

P-Values

pt(2.5, 15, lower.tail = FALSE)  ##the probability of getting a T-statistic of 2.5 (first element of the function), with a a df of 15 (i.e. n-1).
## [1] 0.0122529

Least Squares (Regression)

Sum[i=1 to n] of [Yi - (Bo +B1Xi)]^2 Least squares best fit is the sum of i to the n of Yi which is the i’th sample data y-coordinate minus the intercept plus the slope times the i’th sample data x-coordinate, squarred. Bo = intercept of y-axis B1 = slope Yi = y of sample i Xi = x of sample i Think of it as subtracting the distance between sample’s y-coordinate from the best fit, which is y - (ax + b) which is a linear regression of the best fit, or Yi - Yfit. Then square it.

library(UsingR)
library(ggplot2)

freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+5, show_guide = FALSE))
## Warning: Ignoring unknown aesthetics: show_guide
g <- g + geom_point(aes(colour=freq, size = freq-5))
g <- g + scale_colour_gradient(low = "lightblue", high="white")                    
g

###_____________________Linear Least Squares calculated
y <- galton$child
x <- galton$parent
beta1 <- cor(y, x) *  sd(y) / sd(x)  ##The slope, which is the corelation between y and x times the st.dev of y over the st.dev of x
beta0 <- mean(y) - beta1 * mean(x) ##calculates the y-intercept of the best fit. 
rbind(c(beta0, beta1), coef(lm(y ~ x))) ##lm stands for linear model. coef takes the output and gives us coefficients. REMEMBER - y is the outcome, and x is the predictor. 
##      (Intercept)         x
## [1,]    23.94153 0.6462906
## [2,]    23.94153 0.6462906
##This example gives you the same output, because one is the manual calculation, while the other is done automatically using lm. 

beta1 <- cor(y, x) *  sd(x) / sd(y)  ##if we swapped the predictor 
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0, beta1), coef(lm(x ~ y)))  ##shows us the slope is the same
##      (Intercept)         y
## [1,]    46.13535 0.3256475
## [2,]    46.13535 0.3256475
yc <- y - mean(y)
xc <- x - mean(x)
beta1 <- sum(yc * xc) / sum(xc ^ 2)
c(beta1, coef(lm(y ~ x))[2]) ##returns the same slope
##                   x 
## 0.6462906 0.6462906
###simplified 
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 20), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+5, show_guide = FALSE))
## Warning: Ignoring unknown aesthetics: show_guide
g <- g + geom_point(aes(colour=freq, size = freq-10))
g <- g + scale_colour_gradient(low = "lightblue", high="white")  
g <- g + geom_smooth(method="lm", col ="orange", formula=y~x) ##for fitting the linear best fit to a plot
g

Linear Regression for Prediction

library(UsingR)
data(diamond)
library(ggplot2)
g = ggplot(diamond, aes(x = carat, y = price))
g = g + xlab("Mass (carats)")
g = g + ylab("Price (SIN $)")
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 4, colour = "black", alpha=0.5)
g = g + geom_point(size = 2, colour = "blue", alpha=0.2)
g
## `geom_smooth()` using formula 'y ~ x'

fit <- lm(price~ carat, data = diamond)  ## prints coefficient Bo and B1, slope and intercept
coef(fit)
## (Intercept)       carat 
##   -259.6259   3721.0249
summary(fit) ##gives you the complete statistic
## 
## Call:
## lm(formula = price ~ carat, data = diamond)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## carat        3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16
fit2 <- lm(price ~ I(carat - mean(carat)), data=diamond) ##how you mean center your predictor, carat. I function lets you provide a calculation in the predictor argument. 
coef(fit2)
##            (Intercept) I(carat - mean(carat)) 
##               500.0833              3721.0249
fit3 <- lm(price ~ I(carat * 10), data=diamond) ##shortcut where fit3 produces the price change for 1/10 a carat, not a complete carat. 
coef(fit3)
##   (Intercept) I(carat * 10) 
##     -259.6259      372.1025
newx <- c(0.16, 0.27, 0.34)  ## an example of a list of carat sizes for diamonds, and we want to predict the price. 
predict(fit, newdata = data.frame(carat = newx))  ##taking original fit data, and calling to the newx concatenated values with the category of carat(carat is what it is referencing). new data is designating new x values.  
##         1         2         3 
##  335.7381  745.0508 1005.5225

Residuals

data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
fit <- lm(y ~ x)
e <- resid(fit)  ##easiest way to calculate residuals
yhat <- predict(fit)
max(abs(e -(y - yhat)))
## [1] 9.485746e-13
max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))  ##will return the same as resid()
## [1] 9.485746e-13
plot(x, e,  
     xlab = "Mass (carats)", 
     ylab = "Residuals (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

##_______________Example of non-linear data and residuals
x = runif(100, -3, 3); y = x + sin(x) + rnorm(100, sd = .2); 
library(ggplot2)
library(RColorBrewer)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g
## `geom_smooth()` using formula 'y ~ x'

##_______________
g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), ##How to plot residuals
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g

##____________ Calculating the residual and plotting for diamond data
diamond$e <- resid(lm(price ~ carat, data = diamond))  ##adding a new column into data.frame for residuals
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g

Predictions

fit <- lm(mpg~wt, data= mtcars) ## creating a linear model of mtcars dataset, where weight is the predictor
predict(fit, wt=3000) ##using the model above, and predicting the mpg for all cars weighing 3000lbs.
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##           23.282611           21.919770           24.885952           20.102650 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##           18.900144           18.793255           18.205363           20.236262 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##           20.450041           18.900144           18.900144           15.533127 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##           17.350247           17.083024            9.226650            8.296712 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            8.718926           25.527289           28.653805           27.478021 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##           24.111004           18.472586           18.926866           16.762355 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##           16.735633           26.943574           25.847957           29.198941 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##           20.343151           22.480940           18.205363           22.427495

Multivariate Regressions

Shapiro Wilks test for Normal distribution

data <- rnorm(100)

#perform Shapiro-Wilk test for normality
shapiro.test(data)  ## where we reject normality when p < 0.05, otherwise we accept that the test data is normally distributed.
## 
##  Shapiro-Wilk normality test
## 
## data:  data
## W = 0.99441, p-value = 0.9567

Machine Learning

question - input data - features - algorithm - parameters - evaluation

Prediction has accuracy tradeoffs

In-sample error: the error rate you get on the same data set you used to build your predictor. Sometimes called re-substitution error. Out of sample error: the error rate you get on a new data set, sometimes called generalization error. Out of sample error is what we should care about In sample error is always less than out of sample error The reason is over-fitting. True positive vs. false positive, true negative vs. false negative Sensitivity - probability that the test positively predicted, and that the prediction was right Specificity- probability that the test was negative, and so too was the outcome Receiver Operating Characteristic (ROC) - Area under the curve is the measure of goodness of fit (model/prediction to data/outcome). AUC of 0.5 is random guessing. AUC > 0.8 is a good model. Cross-validation. 1.) use the training set 2.) split the training set into a test set and a smaller training set 3.) Build a model on the training set 4.) Evaluate the model on the test set 5.) repeat and average the estimated errors k-fold cross-validation is popular technique.

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
data(spam)
head(spam)
plot(density(spam$your[spam$type=="nonspam"]),
     col="blue",main="",xlab="Frequency of 'your'")
lines(density(spam$your[spam$type=="spam"]),col="red")
abline(v=0.5,col="black")

##We want to find a value C, which is the cutoff point between spam frequencies, and non-spam frequencies of the word "your". If frequency is above C, predict spam...

prediction <- ifelse(spam$your > 0.5,"spam","nonspam")
table(prediction,spam$type)/length(spam$type)  
##           
## prediction   nonspam      spam
##    nonspam 0.4590306 0.1017170
##    spam    0.1469246 0.2923278

###Caret for machine learning package

obj Class Package Predict Function Syntax lda MASS Predict(obj) (no options needed) glm stats predict(obj, type = “response”) bgm gbm predict(obj, type = “response”, n.trees) mda mda predict(obj, typoe =“posterior”) rpart rpart predict(obj, type=- “prob”) Weka RWeka predict(obj, type= “probability”) LogitBoost caTools predict(obj, type= “raw”, nIter)

** Different types of predictors have different object class syntax requirements.

library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(kernlab) ## to get spam dataset

data(spam)
inTrain <- createDataPartition(y=spam$type, p= 0.75, list = FALSE) ## data is partitioned by spam type, where p is the training proportion. 
training <- spam[inTrain,]  ##subsetting the output of the partition function by spam
testing <- spam[-inTrain,]  ##subsetting the output of the partition function by not spam
dim(training)  ##shows the dimensions of the dataframe
## [1] 3451   58
set.seed(11111)
modelFit <- train(type ~., data= training, method = "glm")  ##training a model, specifically a glm model, where we are comparing column type (spam vs. not spam) with everything ".", and using the subsetted data for training. 
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9213441  0.8350731
modelFit$finalModel  ##will return the fitted values for all of the other columns of data. The higher the fit, the more likely it's spam. 
## 
## Call:  NULL
## 
## Coefficients:
##       (Intercept)               make            address                all  
##        -1.453e+00         -2.097e-01         -1.617e-01          1.525e-01  
##             num3d                our               over             remove  
##         2.070e+00          5.582e-01          6.049e-01          2.195e+00  
##          internet              order               mail            receive  
##         6.044e-01          8.605e-01          1.361e-01         -4.118e-01  
##              will             people             report          addresses  
##        -1.976e-01          6.807e-02          6.363e-02          1.047e+00  
##              free           business              email                you  
##         1.236e+00          9.854e-01         -4.050e-02          1.147e-01  
##            credit               your               font             num000  
##         8.911e-01          2.028e-01          2.563e-01          1.770e+00  
##             money                 hp                hpl             george  
##         2.007e-01         -2.089e+00         -1.015e+00         -7.784e+00  
##            num650                lab               labs             telnet  
##         3.942e-01         -1.989e+00         -4.778e-01         -4.912e+00  
##            num857               data             num415              num85  
##         1.525e+00         -1.247e+00          1.243e+00         -2.569e+00  
##        technology            num1999              parts                 pm  
##         1.265e+00          2.640e-01          1.668e+00         -8.041e-01  
##            direct                 cs            meeting           original  
##        -3.565e-01         -5.148e+02         -2.555e+00         -1.211e+00  
##           project                 re                edu              table  
##        -2.319e+00         -1.069e+00         -1.573e+00         -2.919e+00  
##        conference      charSemicolon   charRoundbracket  charSquarebracket  
##        -4.012e+00         -1.247e+00         -2.209e-01         -6.101e-01  
##   charExclamation         charDollar           charHash         capitalAve  
##         2.468e-01          6.951e+00          1.509e+00         -1.323e-02  
##       capitalLong       capitalTotal  
##         9.523e-03          7.914e-04  
## 
## Degrees of Freedom: 3450 Total (i.e. Null);  3393 Residual
## Null Deviance:       4628 
## Residual Deviance: 1323  AIC: 1439
predictions <- predict(modelFit, newdata = testing)  ##how we can predict on new samples, here using the subset testing data.frame. 
predictions
##    [1] spam    nonspam nonspam spam    spam    spam    nonspam nonspam spam   
##   [10] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##   [19] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##   [28] spam    spam    spam    nonspam spam    spam    nonspam spam    spam   
##   [37] nonspam spam    spam    spam    spam    spam    spam    spam    spam   
##   [46] spam    spam    spam    spam    spam    spam    spam    nonspam spam   
##   [55] spam    spam    spam    spam    spam    spam    spam    nonspam spam   
##   [64] spam    nonspam spam    nonspam spam    spam    spam    spam    spam   
##   [73] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##   [82] spam    nonspam spam    nonspam spam    spam    spam    nonspam spam   
##   [91] spam    spam    spam    spam    spam    spam    nonspam spam    spam   
##  [100] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [109] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [118] spam    spam    nonspam spam    spam    spam    spam    spam    spam   
##  [127] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [136] spam    spam    nonspam spam    spam    spam    spam    spam    spam   
##  [145] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [154] spam    nonspam spam    spam    spam    spam    spam    spam    spam   
##  [163] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [172] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [181] spam    spam    nonspam nonspam spam    spam    spam    spam    spam   
##  [190] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [199] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [208] spam    spam    spam    spam    spam    spam    spam    spam    nonspam
##  [217] spam    spam    spam    nonspam spam    spam    spam    spam    spam   
##  [226] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [235] spam    nonspam nonspam spam    spam    spam    nonspam spam    spam   
##  [244] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [253] nonspam spam    spam    spam    spam    spam    spam    spam    spam   
##  [262] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [271] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [280] spam    spam    spam    spam    spam    nonspam spam    spam    spam   
##  [289] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [298] spam    spam    spam    spam    spam    spam    spam    nonspam spam   
##  [307] spam    spam    spam    spam    spam    nonspam spam    spam    spam   
##  [316] spam    nonspam spam    spam    spam    spam    spam    spam    spam   
##  [325] nonspam spam    nonspam spam    spam    spam    spam    spam    spam   
##  [334] spam    spam    spam    nonspam spam    nonspam spam    spam    spam   
##  [343] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [352] spam    spam    spam    spam    spam    spam    nonspam spam    spam   
##  [361] spam    spam    spam    spam    nonspam nonspam spam    spam    spam   
##  [370] spam    spam    spam    spam    nonspam spam    spam    spam    spam   
##  [379] spam    spam    nonspam nonspam spam    spam    spam    spam    spam   
##  [388] spam    spam    spam    spam    spam    nonspam spam    spam    spam   
##  [397] spam    spam    spam    nonspam spam    spam    nonspam spam    spam   
##  [406] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [415] spam    spam    nonspam spam    spam    spam    spam    nonspam spam   
##  [424] spam    nonspam spam    nonspam spam    spam    spam    spam    spam   
##  [433] nonspam spam    nonspam nonspam spam    spam    nonspam spam    spam   
##  [442] spam    spam    spam    spam    spam    spam    spam    spam    spam   
##  [451] spam    spam    spam    nonspam nonspam nonspam nonspam spam    nonspam
##  [460] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [469] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [478] nonspam nonspam nonspam nonspam nonspam spam    nonspam nonspam nonspam
##  [487] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [496] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [505] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [514] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam    nonspam
##  [523] nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam spam   
##  [532] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [541] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [550] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [559] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [568] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [577] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [586] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [595] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [604] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [613] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam   
##  [622] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [631] nonspam spam    nonspam nonspam nonspam spam    nonspam nonspam nonspam
##  [640] nonspam nonspam spam    nonspam nonspam nonspam nonspam spam    nonspam
##  [649] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [658] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [667] nonspam nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam
##  [676] nonspam nonspam nonspam nonspam nonspam spam    nonspam nonspam nonspam
##  [685] nonspam nonspam nonspam nonspam nonspam nonspam spam    nonspam nonspam
##  [694] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [703] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [712] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [721] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [730] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [739] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [748] nonspam nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam
##  [757] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [766] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [775] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [784] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [793] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [802] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [811] nonspam nonspam nonspam nonspam spam    nonspam nonspam nonspam nonspam
##  [820] spam    nonspam nonspam nonspam nonspam spam    spam    spam    nonspam
##  [829] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [838] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam   
##  [847] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [856] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [865] nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [874] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [883] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [892] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [901] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [910] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [919] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [928] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [937] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [946] spam    nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [955] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [964] nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [973] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [982] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
##  [991] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1000] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1009] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1018] nonspam nonspam nonspam spam    nonspam nonspam nonspam nonspam spam   
## [1027] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1036] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1045] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1054] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1063] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1072] nonspam nonspam nonspam nonspam nonspam spam    nonspam nonspam nonspam
## [1081] nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1090] nonspam nonspam nonspam nonspam nonspam spam    nonspam nonspam nonspam
## [1099] nonspam nonspam spam    nonspam nonspam nonspam nonspam nonspam nonspam
## [1108] nonspam spam    spam    nonspam nonspam nonspam nonspam nonspam nonspam
## [1117] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1126] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1135] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1144] nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## Levels: nonspam spam
confusionMatrix(predictions, testing$type)  ##using the confusingMatrix argument to determine how well the predictions are working, and the focus is on predicting type (spam vs. not spam)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     666   51
##    spam         31  402
##                                           
##                Accuracy : 0.9287          
##                  95% CI : (0.9123, 0.9429)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8495          
##                                           
##  Mcnemar's Test P-Value : 0.03589         
##                                           
##             Sensitivity : 0.9555          
##             Specificity : 0.8874          
##          Pos Pred Value : 0.9289          
##          Neg Pred Value : 0.9284          
##              Prevalence : 0.6061          
##          Detection Rate : 0.5791          
##    Detection Prevalence : 0.6235          
##       Balanced Accuracy : 0.9215          
##                                           
##        'Positive' Class : nonspam         
## 

Data Slicing with caret

library(caret)
library(kernlab) ## to get spam dataset

data(spam)
inTrain <- createDataPartition(y=spam$type, p= 0.75, list = FALSE) ## data is partitioned by spam type, where p is the training proportion. 
training <- spam[inTrain,]  ##subsetting the output of the partition function by spam
testing <- spam[-inTrain,]  ##subsetting the output of the partition function by not spam
dim(training)
## [1] 3451   58
set.seed(11111)
folds <- createFolds(y= spam$type, k=10, list= TRUE, returnTrain=TRUE)  ##k is the number of folds, smap$type is the column being references. You can return either the Training set, or the test set. 
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##   4141   4140   4141   4142   4142   4141   4140   4140   4141   4141

Plotting Predictors + Hmisc for cutting data.frames

library(ISLR)
library(ggplot2)
library(caret)

data(wage)
## Warning in data(wage): data set 'wage' not found
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 
head(Wage, n=20)
inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training); dim(testing)  ##returns the number of rows and columns
## [1] 2102   11
## [1] 898  11
featurePlot(x=training[, c("age", "education", "jobclass")], ##a plotting function that comes with caret, and we're taking a subset of the data, three columns in this instance.
            y= training$wage,  ##this is your output of interest
            plot="pairs")

qplot(age, wage, data=training) ##here we see two distinct groups

qplot(age, wage, colour= jobclass, data= training)

qp <- qplot(age, wage, color= education, data=training)
qp + geom_smooth(method = 'lm', formula= y~x)

library(Hmisc)  ##good library for cutting data.frames into pieces. 
cutWage <- cut2(training$wage, g=3) ##cut2 argument performs the cut, by $wage column, into g=3 pieces. 
table(cutWage)
## cutWage
## [ 20.9, 93) [ 93.0,119) [118.9,318] 
##         707         718         677
p1 <- qplot(cutWage, age, data=training, fill=cutWage, geom=c("boxplot"))
p1          

t1 <- table(cutWage, training$jobclass)  ##great way to make quick tables
t1
##              
## cutWage       1. Industrial 2. Information
##   [ 20.9, 93)           455            252
##   [ 93.0,119)           362            356
##   [118.9,318]           276            401
prop.table(t1,1)  ##proportion table, where the 1 represents proportion in each row. a 2 would be each column. 
##              
## cutWage       1. Industrial 2. Information
##   [ 20.9, 93)     0.6435644      0.3564356
##   [ 93.0,119)     0.5041783      0.4958217
##   [118.9,318]     0.4076809      0.5923191
qplot(wage, color=education, data= training, geom="density")

Pre-processing predictor variables

library(caret)
library(kernlab) ## to get spam dataset
library(RANN)

data(spam)
inTrain <- createDataPartition(y=spam$type, p= 0.75, list = FALSE) ## data is partitioned by spam type, where p is the training proportion. 
training <- spam[inTrain,]  ##subsetting the output of the partition function by spam
testing <- spam[-inTrain,]  ##subsetting the output of the partition function by not spam
dim(training)
## [1] 3451   58
hist(training$capitalAve, main="", xlab="ave.capital run length")

mean(training$capitalAve)
## [1] 5.577148
sd(training$capitalAve) ## what we see here is that the standard deviation is significanly higher than the mean, which indicates that we need some pre-processing. 
## [1] 35.92989
trainCapAve <- training$capitalAve
trainCapAveS <- (trainCapAve - mean(trainCapAve))/sd(trainCapAve) ## A way of standardizing the data
mean(trainCapAveS)
## [1] -1.162129e-17
sd(trainCapAveS)
## [1] 1
##If we want to then standardize the test set, we must use the mean from the training set, and the st.dev from the training set. 
testCapAve <- testing$capitalAve
testCapAveS <- (testCapAve - mean(trainCapAve))/sd(trainCapAve) 
mean(testCapAveS)
## [1] -0.04294102
sd(testCapAveS)
## [1] 0.3437084
##Alternative is passing preProcess into the train() argument
modelFit <- train(type~., data= training, preProcess= c("center", "scale"), method="glm")  ##This processes for all of the predictor variables. 
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9167869  0.8257754
## Prediction models tend to fail with NA. Well use K nearest imputation statistical function. 
# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1  ## we are adding NA's into the dataset here. 
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute")  ##removing the 58th column here
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)

Covariate (predictor) creation

 ##think about creating or calculating variables/features to summarize the sample data, in this instance an entire e-mail if we use the spam dataset. 

library(ISLR); library(caret); data(Wage);
inTrain <- createDataPartition(y=Wage$wage,
                              p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]

##turning qualitative variables into quantitative ones
table(training$jobclass)
## 
##  1. Industrial 2. Information 
##           1067           1035
dummies <- dummyVars(wage ~ jobclass,data=training)  ##using the dummyVars argument to turn categorical info, such as "industrial" or "information" into integers. Outcome is wage, and jobclass is the predictor.
head(predict(dummies,newdata=training)) 
##        jobclass.1. Industrial jobclass.2. Information
## 161300                      1                       0
## 155159                      0                       1
## 376662                      0                       1
## 450601                      1                       0
## 377954                      0                       1
## 228963                      0                       1
##__________________
##A way to throw out less-meaningful predictors, i.e. near-zero variations in the data. 
nsv <- nearZeroVar(training,saveMetrics=TRUE)
nsv  ##in this example we can throw out sex and region

Predicting with Regression

##Using the old faithful dataset
library(caret);data(faithful); set.seed(333)
inTrain <- createDataPartition(y=faithful$waiting,
                              p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
head(trainFaith)
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration") ##plotting the training data. 

lm1 <- lm(eruptions ~ waiting, data=trainFaith)
summary(lm1)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13375 -0.36778  0.06064  0.36578  0.96057 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.648629   0.226603  -7.275 2.55e-11 ***
## waiting      0.072211   0.003136  23.026  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4941 on 135 degrees of freedom
## Multiple R-squared:  0.7971, Adjusted R-squared:  0.7956 
## F-statistic: 530.2 on 1 and 135 DF,  p-value: < 2.2e-16
##output is y(eruption duration) = 0.073 * (waiting time) - 1.792

## To predict a new value, we can automate this by:
coef(lm1)[1] + coef(lm1)[2]*80  ## coef(lm1)[1] returns the intercept, and [2] returns the slope. We are predicting with a wait time of 80
## (Intercept) 
##    4.128276
newdata <- data.frame(waiting=80)
predict(lm1,newdata)  ## a shortcut so we don't have to continuously calculate the coeficients.
##        1 
## 4.128276
##We can not use the predictions from the training set on the TEST set
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
ord <- order(testFaith$waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")

##________________
##We can use CARET to do the same, and much faster
modFit <- train(eruptions ~ waiting,data=trainFaith,method="lm")  ##eruptions is outcome, waiting is predictor. model built on the training dataset, and method is linear modeling. 
summary(modFit$finalModel)  ##How we get final model output
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.13375 -0.36778  0.06064  0.36578  0.96057 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.648629   0.226603  -7.275 2.55e-11 ***
## waiting      0.072211   0.003136  23.026  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4941 on 135 degrees of freedom
## Multiple R-squared:  0.7971, Adjusted R-squared:  0.7956 
## F-statistic: 530.2 on 1 and 135 DF,  p-value: < 2.2e-16

Predicting Multiple Covariate Regression

library(ISLR); library(ggplot2); library(caret);
data(Wage); Wage <- subset(Wage,select=-c(logwage))  ##here we are subsetting and removing the variable we are trying to predict
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins        wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.: 85.38  
##                                      Median :104.92  
##                                      Mean   :111.70  
##                                      3rd Qu.:128.68  
##                                      Max.   :318.34  
## 
inTrain <- createDataPartition(y=Wage$wage,##subsetting into test and train
                              p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training); dim(testing)
## [1] 2102   10
## [1] 898  10
featurePlot(x=training[,c("age","education","jobclass")],
            y = training$wage,
            plot="pairs")

##________________

modFit<- train(wage ~ age + jobclass + education,
               method = "lm",data=training)
finMod <- modFit$finalModel
print(modFit)
## Linear Regression 
## 
## 2102 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   35.56759  0.2589245  24.87554
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
plot(finMod,1,pch=19,cex=0.5,col="#00000010")  ##residuals vs. fitted. We want to see a straight line

qplot(finMod$fitted,finMod$residuals,colour=race,data=training)  ##also fitted vs. residuals. Trying to identify different trends here.